Conversation
📝 WalkthroughWalkthroughAdds a strategy-based restart flow for config-triggered Elasticsearch restarts: new role defaults and handlers, extracted tasks for upgrade-detection and cluster-settings, per-node rolling restart tasks with allocation/flush/health gating, a fake ES HTTP server, Molecule side-effect play, and multiple integration contract tests and inventories. Changes
Sequence Diagram(s)sequenceDiagram
participant Handler as Restart Handler
participant Orchestrator as Rolling Orchestrator
participant NodeA as ES Node A
participant NodeB as ES Node B
participant Cluster as Cluster API / Master
participant FakeAPI as Fake ES API
Handler->>Orchestrator: Trigger restart (strategy=rolling)
Orchestrator->>Orchestrator: validate strategy & credentials
Orchestrator->>Cluster: PUT _cluster/settings (disable allocation)
Cluster-->>Orchestrator: acknowledged
Orchestrator->>NodeA: (optional) POST /_flush
NodeA-->>Orchestrator: flush response
Orchestrator->>NodeA: systemctl restart elasticsearch
NodeA-->>Orchestrator: service restarted
Orchestrator->>FakeAPI: poll /_cat/nodes until NodeA present
FakeAPI-->>Orchestrator: node present
Orchestrator->>Cluster: PUT _cluster/settings (enable allocation)
Cluster-->>Orchestrator: acknowledged
Orchestrator->>Cluster: wait for health (yellow/green)
Cluster-->>Orchestrator: health ok
Note right of Orchestrator: repeat for NodeB
Orchestrator-->>Handler: Rolling restart complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (5)
roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling.yml (1)
34-39: Redundantwhencondition.The
when: _elasticsearch_restart_hosts | length > 0check on line 39 is unnecessary—looping over an empty list naturally results in zero iterations. Removing it simplifies the code.Optional simplification
- name: restart_and_verify_elasticsearch_rolling | Restart requested nodes one at a time ansible.builtin.include_tasks: restart_and_verify_elasticsearch_rolling_node.yml loop: "{{ _elasticsearch_restart_hosts }}" loop_control: loop_var: _elasticsearch_restart_host - when: _elasticsearch_restart_hosts | length > 0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling.yml` around lines 34 - 39, The task "restart_and_verify_elasticsearch_rolling | Restart requested nodes one at a time" redundantly uses "when: _elasticsearch_restart_hosts | length > 0" before looping; remove that when condition and rely on looping over "loop: {{ _elasticsearch_restart_hosts }}" (with loop_control.loop_var: _elasticsearch_restart_host) so the include_tasks call to restart_and_verify_elasticsearch_rolling_node.yml naturally no-ops for an empty list..gitignore (1)
10-18: Redundant gitignore rule on line 10.Line 10 (
tests/integration/) ignores the directory itself, but line 11 immediately negates it with!tests/integration/. Since line 12 (tests/integration/*) already ignores the directory contents, line 10 is redundant and creates confusion.Consider removing line 10 if you only want to ignore contents while tracking specific files:
Proposed simplification
-tests/integration/ -!tests/integration/ tests/integration/* !tests/integration/elasticsearch_cluster_settings_contract.yml !tests/integration/elasticsearch_template_contract.yml🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.gitignore around lines 10 - 18, Remove the redundant ignore rule "tests/integration/"; keep "tests/integration/*" and the explicit negations (e.g., "!tests/integration/elasticsearch_cluster_settings_contract.yml", etc.) so the directory is not ignored but its contents are selectively ignored/tracked—delete the lone "tests/integration/" entry from the .gitignore.tests/fakes/fake_es_rolling_api.py (1)
113-128: Consider adding graceful shutdown capability.The server blocks indefinitely with
threading.Event().wait()which works for the test scenario (killed via PID), but doesn't support graceful shutdown. This is acceptable for test infrastructure, but a signal handler could make debugging easier.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/fakes/fake_es_rolling_api.py` around lines 113 - 128, The main() function currently blocks forever with threading.Event().wait(), preventing graceful shutdown; modify main to create a shared threading.Event (e.g., stop_event), pass it into serve (or make it accessible to State/serve), and install signal handlers for SIGINT/SIGTERM that set stop_event so threads can exit cleanly; after starting threads, wait on stop_event and then join threads (or allow serve to exit when stop_event.is_set()), ensuring State/serve respect the stop_event for orderly teardown.molecule/elasticsearch_default/side_effect.yml (1)
23-26: Add error handling for PID capture.If Elasticsearch isn't running when this task executes,
pgrepreturns non-zero and the task fails. Consider addingfailed_when: falseor checking the return code, especially since this runs at the start of the side effect test.Proposed fix
- name: Record ES PID before config change ansible.builtin.command: pgrep -f 'org.elasticsearch.bootstrap.Elasticsearch' register: _es_pid_before_config_restart changed_when: false + failed_when: false🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@molecule/elasticsearch_default/side_effect.yml` around lines 23 - 26, The "Record ES PID before config change" task using the command pgrep -f 'org.elasticsearch.bootstrap.Elasticsearch' can fail if Elasticsearch isn't running; update this task (the one that registers _es_pid_before_config_restart) to avoid failing the play by adding explicit error handling such as failed_when: false (or ignore_errors: true) and keep changed_when: false so the play continues when pgrep returns non-zero; ensure the task still registers _es_pid_before_config_restart and downstream tasks check its.rc or stdout to detect absence of a PID.tests/integration/rolling_restart_contract.yml (1)
88-94: Assertion patterns may not match JSONL key order.The assertions at lines 91-92 search for
'"path": "/_flush", "port": 19200'. However, looking at the fake API'slog()method (context snippet 2), keys are sorted alphabetically (sort_keys=True), so the actual order is:body,method,path,port. The pattern should work since both"path"and"port"appear consecutively in sorted order.Lines 93-94 assert
"primaries"and"null"appear in the log—these likely come from the allocation settings PUT body. Consider adding comments explaining what these assertions verify.Clarify assertion intent
- name: Assert both fake nodes were processed with flush ansible.builtin.assert: that: - '_fake_api_log.stdout is search(''"path": "/_flush", "port": 19200'')' - '_fake_api_log.stdout is search(''"path": "/_flush", "port": 19201'')' + # Verify allocation settings were applied (primaries allocation, null to reset) - "_fake_api_log.stdout is search('primaries')" - "_fake_api_log.stdout is search('null')"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/rolling_restart_contract.yml` around lines 88 - 94, The current assertions in tests/integration/rolling_restart_contract.yml check literal key order in _fake_api_log.stdout (e.g., '"path": "/_flush", "port": 19200') which is fragile given the fake API's log() uses sort_keys=True; update the test to either use order-independent checks (e.g., separate contains checks for '"path": "/_flush"' and '"port": 19200' or a regex that allows any ordering) or keep the existing checks but add a clear comment referencing the fake API's log() sort_keys=True behavior and that the assertion expects alphabetic ordering (and also clarify what the '"primaries"' and '"null"' assertions validate) so intent is explicit; target the assertions on _fake_api_log and mention log()/sort_keys=True in the comment or refactor the two lines that combine path and port into independent assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@molecule/elasticsearch_default/molecule.yml`:
- Line 24: The ANSIBLE_COLLECTIONS_PATH currently contains a hardcoded,
user-specific path "/home/sam/.ansible/collections"; update the
ANSIBLE_COLLECTIONS_PATH value to avoid a fixed user home and use a portable
variable such as ${HOME} (or fallback env var) instead — e.g. set
ANSIBLE_COLLECTIONS_PATH to
"${MOLECULE_PROJECT_DIRECTORY}/.ansible/collections:${HOME}/.ansible/collections:/usr/share/ansible/collections"
or use an env-default pattern like
"${ANSIBLE_COLLECTIONS_PATH:-${MOLECULE_PROJECT_DIRECTORY}/.ansible/collections:${HOME}/.ansible/collections:/usr/share/ansible/collections}"
so contributors' home directories are resolved dynamically.
In `@roles/elasticsearch/handlers/main.yml`:
- Around line 3-12: The when conditions in the handler reference registered
variables that may be undefined (_elasticsearch_freshstart,
_elasticsearch_freshstart_security, and similar vars at the other locations),
causing failures when accessing .changed; update each when clause to guard those
accesses by using a safe default or defined check (e.g. replace occurrences like
"not _elasticsearch_freshstart.changed | bool" with "not
(_elasticsearch_freshstart.changed | default(false) | bool)" or include
"_elasticsearch_freshstart is defined and ..." ), and apply the same pattern to
the other referenced vars mentioned (the checks around
_elasticsearch_freshstart_security and the other occurrences on lines 23-24,
37-38, and 50-51) so .changed is never accessed on an undefined variable.
In `@roles/elasticsearch/tasks/elasticsearch-cluster-settings.yml`:
- Around line 3-8: The task sets _elasticsearch_effective_cluster_settings using
elasticsearch_logsdb without a default which can raise an undefined variable
error; update the expression in the ansible.builtin.set_fact that builds
_elasticsearch_effective_cluster_settings to guard elasticsearch_logsdb with a
default (e.g., use elasticsearch_logsdb | default(false) | bool) so the ternary
evaluation always has a boolean value, keeping the existing
combine(elasticsearch_cluster_settings | default({})) logic intact.
In `@roles/elasticsearch/tasks/elasticsearch-upgrade-detection.yml`:
- Around line 14-23: The task "elasticsearch-upgrade-detection" sets
_elasticsearch_needs_rolling_upgrade but uses elasticstack_release without a
defined check; update the Jinja2 expression in the ansible.builtin.set_fact to
guard uses of elasticstack_release (e.g., add an "elasticstack_release is
defined and" before the elasticstack_release | int comparison) so the whole OR
clause only evaluates that comparison when elasticstack_release exists, leaving
the rest of the logic intact in the _elasticsearch_needs_rolling_upgrade
assignment.
In `@roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling_node.yml`:
- Line 16: The current conditional for _elasticsearch_health_statuses uses
elasticsearch_config_restart_wait_status and may allow 'yellow' for the
post-restart gate; change the logic so that when re-enabling shard allocation
after a node restart the expected health is strictly ['green'] (i.e. set
_elasticsearch_health_statuses = ['green'] for the post-reenable step regardless
of elasticsearch_config_restart_wait_status), update the same logic sites that
reuse _elasticsearch_health_statuses (references: _elasticsearch_health_statuses
and elasticsearch_config_restart_wait_status) so the gate that runs after
re-enabling allocation always waits for green before proceeding.
- Around line 225-261: The two cluster-wide API calls in the tasks named
"restart_and_verify_elasticsearch_rolling_node | Re-enable shard allocation" and
"restart_and_verify_elasticsearch_rolling_node | Wait for cluster health after
restart" currently delegate to _elasticsearch_restart_host (the node being
restarted) — change both to delegate to a stable API host variable instead (e.g.
_elasticsearch_cluster_api_host) and ensure that variable is computed to exclude
the restart target (for example: set _elasticsearch_cluster_api_host to
(groups['elasticsearch'] | difference([_elasticsearch_restart_host]) | first) or
a manually chosen admin node); update the tasks' delegate_to to "{{
_elasticsearch_cluster_api_host | default(_elasticsearch_restart_host) }}" so
the cleanup and health checks run via a known-live cluster node, and add the
variable resolution earlier in the playbook or role to guarantee a fallback if
no alternate host exists.
---
Nitpick comments:
In @.gitignore:
- Around line 10-18: Remove the redundant ignore rule "tests/integration/"; keep
"tests/integration/*" and the explicit negations (e.g.,
"!tests/integration/elasticsearch_cluster_settings_contract.yml", etc.) so the
directory is not ignored but its contents are selectively ignored/tracked—delete
the lone "tests/integration/" entry from the .gitignore.
In `@molecule/elasticsearch_default/side_effect.yml`:
- Around line 23-26: The "Record ES PID before config change" task using the
command pgrep -f 'org.elasticsearch.bootstrap.Elasticsearch' can fail if
Elasticsearch isn't running; update this task (the one that registers
_es_pid_before_config_restart) to avoid failing the play by adding explicit
error handling such as failed_when: false (or ignore_errors: true) and keep
changed_when: false so the play continues when pgrep returns non-zero; ensure
the task still registers _es_pid_before_config_restart and downstream tasks
check its.rc or stdout to detect absence of a PID.
In `@roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling.yml`:
- Around line 34-39: The task "restart_and_verify_elasticsearch_rolling |
Restart requested nodes one at a time" redundantly uses "when:
_elasticsearch_restart_hosts | length > 0" before looping; remove that when
condition and rely on looping over "loop: {{ _elasticsearch_restart_hosts }}"
(with loop_control.loop_var: _elasticsearch_restart_host) so the include_tasks
call to restart_and_verify_elasticsearch_rolling_node.yml naturally no-ops for
an empty list.
In `@tests/fakes/fake_es_rolling_api.py`:
- Around line 113-128: The main() function currently blocks forever with
threading.Event().wait(), preventing graceful shutdown; modify main to create a
shared threading.Event (e.g., stop_event), pass it into serve (or make it
accessible to State/serve), and install signal handlers for SIGINT/SIGTERM that
set stop_event so threads can exit cleanly; after starting threads, wait on
stop_event and then join threads (or allow serve to exit when
stop_event.is_set()), ensuring State/serve respect the stop_event for orderly
teardown.
In `@tests/integration/rolling_restart_contract.yml`:
- Around line 88-94: The current assertions in
tests/integration/rolling_restart_contract.yml check literal key order in
_fake_api_log.stdout (e.g., '"path": "/_flush", "port": 19200') which is fragile
given the fake API's log() uses sort_keys=True; update the test to either use
order-independent checks (e.g., separate contains checks for '"path": "/_flush"'
and '"port": 19200' or a regex that allows any ordering) or keep the existing
checks but add a clear comment referencing the fake API's log() sort_keys=True
behavior and that the assertion expects alphabetic ordering (and also clarify
what the '"primaries"' and '"null"' assertions validate) so intent is explicit;
target the assertions on _fake_api_log and mention log()/sort_keys=True in the
comment or refactor the two lines that combine path and port into independent
assertions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: bc10b5c3-0270-4780-8150-92c82c54e803
📒 Files selected for processing (17)
.gitignoremolecule/elasticsearch_default/molecule.ymlmolecule/elasticsearch_default/side_effect.ymlroles/elasticsearch/defaults/main.ymlroles/elasticsearch/handlers/main.ymlroles/elasticsearch/tasks/elasticsearch-cluster-settings.ymlroles/elasticsearch/tasks/elasticsearch-upgrade-detection.ymlroles/elasticsearch/tasks/main.ymlroles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling.ymlroles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling_node.ymltests/fakes/fake_es_rolling_api.pytests/integration/elasticsearch_cluster_settings_contract.ymltests/integration/elasticsearch_template_contract.ymltests/integration/elasticsearch_template_inventory.initests/integration/elasticsearch_upgrade_detection_contract.ymltests/integration/rolling_restart_contract.ymltests/integration/rolling_restart_inventory.ini
| else hostvars[_elasticsearch_restart_host].elasticsearch_http_protocol | ||
| | default(elasticsearch_http_protocol | default('http')) }} | ||
| _elasticsearch_restart_port: "{{ hostvars[_elasticsearch_restart_host].elasticstack_elasticsearch_http_port | default(elasticstack_elasticsearch_http_port) }}" | ||
| _elasticsearch_health_statuses: "{{ ['green'] if elasticsearch_config_restart_wait_status == 'green' else ['green', 'yellow'] }}" |
There was a problem hiding this comment.
Require green after re-enabling allocation.
_elasticsearch_health_statuses is reused for the post-restart gate, so with the default elasticsearch_config_restart_wait_status: yellow this can advance to the next node while the cluster is still yellow. Issue #121 explicitly requires waiting for green before moving on, otherwise the next restart can start with replicas still unassigned.
Suggested change
- _elasticsearch_health_statuses: "{{ ['green'] if elasticsearch_config_restart_wait_status == 'green' else ['green', 'yellow'] }}"
+ _elasticsearch_pre_restart_health_statuses: "{{ ['green'] if elasticsearch_config_restart_wait_status == 'green' else ['green', 'yellow'] }}"
+ _elasticsearch_post_restart_health_statuses: ['green']
@@
- - (_elasticsearch_pre_restart_health.json.status | default('')) in _elasticsearch_health_statuses
+ - (_elasticsearch_pre_restart_health.json.status | default('')) in _elasticsearch_pre_restart_health_statuses
@@
- - (_elasticsearch_post_restart_health.json.status | default('')) in _elasticsearch_health_statuses
+ - (_elasticsearch_post_restart_health.json.status | default('')) in _elasticsearch_post_restart_health_statuses
@@
- - (_elasticsearch_post_restart_health.json.status | default('')) in _elasticsearch_health_statuses
+ - (_elasticsearch_post_restart_health.json.status | default('')) in _elasticsearch_post_restart_health_statusesAlso applies to: 166-183, 243-261
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling_node.yml`
at line 16, The current conditional for _elasticsearch_health_statuses uses
elasticsearch_config_restart_wait_status and may allow 'yellow' for the
post-restart gate; change the logic so that when re-enabling shard allocation
after a node restart the expected health is strictly ['green'] (i.e. set
_elasticsearch_health_statuses = ['green'] for the post-reenable step regardless
of elasticsearch_config_restart_wait_status), update the same logic sites that
reuse _elasticsearch_health_statuses (references: _elasticsearch_health_statuses
and elasticsearch_config_restart_wait_status) so the gate that runs after
re-enabling allocation always waits for green before proceeding.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling_node.yml (1)
16-16:⚠️ Potential issue | 🟠 MajorRequire
greenafter allocation is re-enabled.
_elasticsearch_health_statusesis still reused for the post-restart gates, so a run withelasticsearch_config_restart_wait_status: yellowcan proceed while replicas are still unassigned. That breaks the rolling-restart contract from issue#121for the post-reenable phase.Suggested fix
- _elasticsearch_health_statuses: "{{ ['green'] if elasticsearch_config_restart_wait_status == 'green' else ['green', 'yellow'] }}" + _elasticsearch_pre_restart_health_statuses: "{{ ['green'] if elasticsearch_config_restart_wait_status == 'green' else ['green', 'yellow'] }}" + _elasticsearch_post_restart_health_statuses: ['green'] @@ - - (_elasticsearch_pre_restart_health.json.status | default('')) in _elasticsearch_health_statuses + - (_elasticsearch_pre_restart_health.json.status | default('')) in _elasticsearch_pre_restart_health_statuses @@ - - (_elasticsearch_post_restart_health.json.status | default('')) in _elasticsearch_health_statuses + - (_elasticsearch_post_restart_health.json.status | default('')) in _elasticsearch_post_restart_health_statuses @@ - - (_elasticsearch_post_restart_health.json.status | default('')) in _elasticsearch_health_statuses + - (_elasticsearch_post_restart_health.json.status | default('')) in _elasticsearch_post_restart_health_statusesAlso applies to: 181-193, 258-270
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling_node.yml` at line 16, The current variable _elasticsearch_health_statuses uses elasticsearch_config_restart_wait_status for both pre-restart and post-reenable checks, letting a run proceed with yellow even after shard allocation is re-enabled; change the post-reenable gate to require ['green'] unconditionally. Locate uses of _elasticsearch_health_statuses around the post-restart verification (references to _elasticsearch_health_statuses and elasticsearch_config_restart_wait_status) and either introduce a separate variable (e.g., _elasticsearch_health_statuses_post_reenable = ['green']) or override the value for the post-reenable checks so that after allocation is re-enabled the code always waits for ['green'].
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@roles/elasticsearch/handlers/main.yml`:
- Around line 20-22: The Jinja conditional uses
groups[elasticstack_elasticsearch_group_name] without a local default, which can
fail when this role is used standalone; update every occurrence of
groups[elasticstack_elasticsearch_group_name] in the file (the three
conditionals matching the shown diff blocks) to use the same safe fallback as
elsewhere: groups[elasticstack_elasticsearch_group_name] | default([]) so the
expression becomes e.g. groups[elasticstack_elasticsearch_group_name] |
default([]) | length <= 1 (or the equivalent used in the first block), ensuring
the direct/rolling decision won't error when the group variable is missing.
In `@roles/elasticsearch/tasks/elasticsearch-cluster-settings.yml`:
- Around line 35-39: The idempotency check in the Jinja loop (_needs_update) can
mis-detect boolean changes because it compares value | string to _current[key] |
string while Elasticsearch may return lowercase "true"/"false"; update the
comparison in the loop that references _elasticsearch_effective_cluster_settings
and _current so both sides are normalized (e.g., apply |string|lower to
_current[key] and value, or coerce both to boolean using |bool) before comparing
and keep the existing None check for _current.get(key); modify the conditional
that sets ns.changed to use this normalized comparison so boolean settings no
longer trigger repeated PUTs.
---
Duplicate comments:
In `@roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling_node.yml`:
- Line 16: The current variable _elasticsearch_health_statuses uses
elasticsearch_config_restart_wait_status for both pre-restart and post-reenable
checks, letting a run proceed with yellow even after shard allocation is
re-enabled; change the post-reenable gate to require ['green'] unconditionally.
Locate uses of _elasticsearch_health_statuses around the post-restart
verification (references to _elasticsearch_health_statuses and
elasticsearch_config_restart_wait_status) and either introduce a separate
variable (e.g., _elasticsearch_health_statuses_post_reenable = ['green']) or
override the value for the post-reenable checks so that after allocation is
re-enabled the code always waits for ['green'].
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 451fc45c-53e9-4c53-8b5f-c35982551990
📒 Files selected for processing (8)
molecule/elasticsearch_default/molecule.ymlroles/elasticsearch/defaults/main.ymlroles/elasticsearch/handlers/main.ymlroles/elasticsearch/tasks/elasticsearch-cluster-settings.ymlroles/elasticsearch/tasks/elasticsearch-upgrade-detection.ymlroles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling_node.ymltests/fakes/fake_es_rolling_api.pytests/integration/rolling_restart_contract.yml
✅ Files skipped from review due to trivial changes (1)
- tests/fakes/fake_es_rolling_api.py
🚧 Files skipped from review as they are similar to previous changes (3)
- molecule/elasticsearch_default/molecule.yml
- roles/elasticsearch/tasks/elasticsearch-upgrade-detection.yml
- roles/elasticsearch/defaults/main.yml
| _needs_update: >- | ||
| {% set ns = namespace(changed=false) %} | ||
| {% for key, value in _elasticsearch_effective_cluster_settings.items() %} | ||
| {% if _current.get(key) is none or _current[key] | string != value | string %} | ||
| {% set ns.changed = true %} |
There was a problem hiding this comment.
Boolean setting values can break idempotent diff detection.
On Line 38, string comparison can mis-detect changes when desired values are YAML booleans (True/False) but Elasticsearch returns lowercase strings (true/false), causing repeated PUTs.
Proposed fix
- _needs_update: >-
- {% set ns = namespace(changed=false) %}
- {% for key, value in _elasticsearch_effective_cluster_settings.items() %}
- {% if _current.get(key) is none or _current[key] | string != value | string %}
- {% set ns.changed = true %}
- {% endif %}
- {% endfor %}
- {{ ns.changed }}
+ _needs_update: >-
+ {% set ns = namespace(changed=false) %}
+ {% for key, value in _elasticsearch_effective_cluster_settings.items() %}
+ {% if value is boolean %}
+ {% set expected = 'true' if value else 'false' %}
+ {% else %}
+ {% set expected = value | string %}
+ {% endif %}
+ {% if _current.get(key) is none or (_current[key] | string) != expected %}
+ {% set ns.changed = true %}
+ {% endif %}
+ {% endfor %}
+ {{ ns.changed }}As per coding guidelines, "Review for Ansible best practices: idempotency (command/shell needs creates/removes guards), correct handler notifications, proper use of become/become_user, and platform conditionals (Debian vs RHEL)."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@roles/elasticsearch/tasks/elasticsearch-cluster-settings.yml` around lines 35
- 39, The idempotency check in the Jinja loop (_needs_update) can mis-detect
boolean changes because it compares value | string to _current[key] | string
while Elasticsearch may return lowercase "true"/"false"; update the comparison
in the loop that references _elasticsearch_effective_cluster_settings and
_current so both sides are normalized (e.g., apply |string|lower to
_current[key] and value, or coerce both to boolean using |bool) before comparing
and keep the existing None check for _current.get(key); modify the conditional
that sets ns.changed to use this normalized comparison so boolean settings no
longer trigger repeated PUTs.
We drop a local /home/sam path that snuck into molecule env, guard the handler and task expressions that dereference sentinel facts so standalone role use won't trip an undefined-variable, and default the post-restart wait to green to match the issue spec and upstream rolling-upgrade docs. The cleanup in the always: block previously delegated back to the node we had just tried to restart, so if that node was the one failing we'd lose our chance to re-enable shard allocation on the rest of the cluster. It now picks a surviving peer from the elasticsearch group. The contract test grew a new scenario that forces a rejoin failure on es1 and asserts the cleanup PUT lands on es2.
25f7be9 to
3ab5c67
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (3)
roles/elasticsearch/tasks/elasticsearch-cluster-settings.yml (1)
35-39:⚠️ Potential issue | 🟠 MajorNormalize booleans before comparison to preserve idempotency.
Line 38 still compares raw string forms; YAML booleans (
True/False) can differ from Elasticsearch-returned lowercase values (true/false), causing repeated PUTs.Suggested fix
_needs_update: >- {% set ns = namespace(changed=false) %} {% for key, value in _elasticsearch_effective_cluster_settings.items() %} - {% if _current.get(key) is none or _current[key] | string != value | string %} + {% if value is boolean %} + {% set expected = 'true' if value else 'false' %} + {% else %} + {% set expected = value | string %} + {% endif %} + {% if _current.get(key) is none or (_current[key] | string) != expected %} {% set ns.changed = true %} {% endif %} {% endfor %} {{ ns.changed }}As per coding guidelines, "Review for Ansible best practices: idempotency (command/shell needs creates/removes guards), correct handler notifications, proper use of become/become_user, and platform conditionals (Debian vs RHEL)."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@roles/elasticsearch/tasks/elasticsearch-cluster-settings.yml` around lines 35 - 39, The comparison inside _needs_update can misdetect changes because YAML booleans may be True/False while Elasticsearch returns lowercase true/false; update the loop comparing _elasticsearch_effective_cluster_settings and _current so both sides are normalized before comparing: for each key use a normalized form like (value is boolean ? (value | string | lower) : (value | string)) and the same normalization for _current[key] (check _current.get(key) is defined first), then set ns.changed = true only if the normalized strings differ; this targets the existing symbols _needs_update, _elasticsearch_effective_cluster_settings, _current and ns.changed.roles/elasticsearch/handlers/main.yml (1)
20-22:⚠️ Potential issue | 🟠 MajorDefault the group-name variable before indexing
groups[...].These conditions still index
groups[elasticstack_elasticsearch_group_name]directly. If this role is used standalone and that variable is unset, the expression can fail before the trailing| default([])is applied, so the direct/rolling split crashes instead of falling back safely.Suggested hardening
- or groups[elasticstack_elasticsearch_group_name] | default([]) | length <= 1 + or groups[elasticstack_elasticsearch_group_name | default('elasticsearch')] | default([]) | length <= 1 ... - or groups[elasticstack_elasticsearch_group_name] | default([]) | length <= 1 + or groups[elasticstack_elasticsearch_group_name | default('elasticsearch')] | default([]) | length <= 1 ... - - groups[elasticstack_elasticsearch_group_name] | default([]) | length > 1 + - groups[elasticstack_elasticsearch_group_name | default('elasticsearch')] | default([]) | length > 1Also applies to: 34-36, 48-49
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@roles/elasticsearch/handlers/main.yml` around lines 20 - 22, The condition indexes groups[elasticstack_elasticsearch_group_name] directly which can raise if elasticstack_elasticsearch_group_name is undefined; update all such expressions (the conditional using elasticsearch_config_restart_strategy and the other occurrences noted at 34-36 and 48-49) to safely default before indexing, e.g. use groups.get(elasticstack_elasticsearch_group_name, []) | default([]) or groups.get(elasticstack_elasticsearch_group_name, []) | length where appropriate so the expression never attempts to index with an undefined variable; apply this change to every occurrence of groups[elasticstack_elasticsearch_group_name] in the handler file.roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling_node.yml (1)
16-16:⚠️ Potential issue | 🟠 MajorRequire
greenafter re-enabling allocation.This status list is reused by the post-restart health checks, so
elasticsearch_config_restart_wait_status: yellowstill allows the workflow to advance while the cluster is yellow. Issue#121requires the post-allocation gate to wait forgreenbefore moving to the next node.Suggested fix
- _elasticsearch_health_statuses: "{{ ['green'] if elasticsearch_config_restart_wait_status == 'green' else ['green', 'yellow'] }}" + _elasticsearch_pre_restart_health_statuses: "{{ ['green'] if elasticsearch_config_restart_wait_status == 'green' else ['green', 'yellow'] }}" + _elasticsearch_post_restart_health_statuses: ['green']Then use
_elasticsearch_pre_restart_health_statusesfor the pre-restart wait, and_elasticsearch_post_restart_health_statusesfor the post-reenable waits.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling_node.yml` at line 16, The current single variable _elasticsearch_health_statuses allows a yellow cluster to pass post-restart checks; split it into two variables and update usages: create _elasticsearch_pre_restart_health_statuses = "{{ ['green'] if elasticsearch_config_restart_wait_status == 'green' else ['green','yellow'] }}" and _elasticsearch_post_restart_health_statuses = "['green']" and replace references that run before the restart with _elasticsearch_pre_restart_health_statuses and references that run after re-enabling allocation with _elasticsearch_post_restart_health_statuses so the post-allocation gate always requires green.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@roles/elasticsearch/handlers/main.yml`:
- Around line 20-22: The condition indexes
groups[elasticstack_elasticsearch_group_name] directly which can raise if
elasticstack_elasticsearch_group_name is undefined; update all such expressions
(the conditional using elasticsearch_config_restart_strategy and the other
occurrences noted at 34-36 and 48-49) to safely default before indexing, e.g.
use groups.get(elasticstack_elasticsearch_group_name, []) | default([]) or
groups.get(elasticstack_elasticsearch_group_name, []) | length where appropriate
so the expression never attempts to index with an undefined variable; apply this
change to every occurrence of groups[elasticstack_elasticsearch_group_name] in
the handler file.
In `@roles/elasticsearch/tasks/elasticsearch-cluster-settings.yml`:
- Around line 35-39: The comparison inside _needs_update can misdetect changes
because YAML booleans may be True/False while Elasticsearch returns lowercase
true/false; update the loop comparing _elasticsearch_effective_cluster_settings
and _current so both sides are normalized before comparing: for each key use a
normalized form like (value is boolean ? (value | string | lower) : (value |
string)) and the same normalization for _current[key] (check _current.get(key)
is defined first), then set ns.changed = true only if the normalized strings
differ; this targets the existing symbols _needs_update,
_elasticsearch_effective_cluster_settings, _current and ns.changed.
In `@roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling_node.yml`:
- Line 16: The current single variable _elasticsearch_health_statuses allows a
yellow cluster to pass post-restart checks; split it into two variables and
update usages: create _elasticsearch_pre_restart_health_statuses = "{{ ['green']
if elasticsearch_config_restart_wait_status == 'green' else ['green','yellow']
}}" and _elasticsearch_post_restart_health_statuses = "['green']" and replace
references that run before the restart with
_elasticsearch_pre_restart_health_statuses and references that run after
re-enabling allocation with _elasticsearch_post_restart_health_statuses so the
post-allocation gate always requires green.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8aaad5e7-0a56-4d9c-9cbf-f8f6cf343b19
📒 Files selected for processing (17)
.gitignoremolecule/elasticsearch_default/molecule.ymlmolecule/elasticsearch_default/side_effect.ymlroles/elasticsearch/defaults/main.ymlroles/elasticsearch/handlers/main.ymlroles/elasticsearch/tasks/elasticsearch-cluster-settings.ymlroles/elasticsearch/tasks/elasticsearch-upgrade-detection.ymlroles/elasticsearch/tasks/main.ymlroles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling.ymlroles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling_node.ymltests/fakes/fake_es_rolling_api.pytests/integration/elasticsearch_cluster_settings_contract.ymltests/integration/elasticsearch_template_contract.ymltests/integration/elasticsearch_template_inventory.initests/integration/elasticsearch_upgrade_detection_contract.ymltests/integration/rolling_restart_contract.ymltests/integration/rolling_restart_inventory.ini
✅ Files skipped from review due to trivial changes (8)
- .gitignore
- tests/integration/rolling_restart_inventory.ini
- tests/integration/elasticsearch_template_inventory.ini
- roles/elasticsearch/defaults/main.yml
- tests/integration/elasticsearch_upgrade_detection_contract.yml
- tests/integration/elasticsearch_template_contract.yml
- tests/fakes/fake_es_rolling_api.py
- molecule/elasticsearch_default/side_effect.yml
🚧 Files skipped from review as they are similar to previous changes (4)
- molecule/elasticsearch_default/molecule.yml
- roles/elasticsearch/tasks/restart_and_verify_elasticsearch_rolling.yml
- roles/elasticsearch/tasks/main.yml
- roles/elasticsearch/tasks/elasticsearch-upgrade-detection.yml
…olecule The provisioner env overrode ANSIBLE_COLLECTIONS_PATH with a path that doesn't include the runner's temp collections dir, so ansible couldn't resolve the oddly.elasticstack.repos role under self-hosted CI. Every other scenario inherits the env correctly; this one had a leftover from local dev. Dropping it restores the CI-provided path.
Fixes #121. Adds health-gated rolling restarts for Elasticsearch config changes plus contract and Molecule coverage.
Summary by CodeRabbit
New Features
Tests
Chores